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The identification of communities in large-scale networks is a challenging task to existing searching schemes 
when the communities overlap significantly among their members, as often the case in large-scale social net¬ 
works. The strong overlaps render many algorithms invalid. We propose a detection scheme based on properly 
merging the partial communities revealed by the ego networks of each vertex. The general principle, merger 
criteria, and post-processing procedures are discussed. This partial community merger algorithm (PCMA) is 
tested on two modern benchmark models. It shows a linear time complexity and it performs accurately and ef¬ 
ficiently when compared with two widely used algorithms. PCMA is then applied to a huge social network and 
millions of communities are identified. A detected community can be visualized with all its members as well 
as the number of different communities that each member belongs to. The multiple memberships of a vertex, 
in turn, illustrates the significant overlaps between communities that calls for the need of a novel and efficient 
algorithm such as PCMA. 

PACS numbers: 89.75.Fb, 89.65.Ef, 89.75.Hc, 89.20.Ff 


I. INTRODUCTION 

Community structure is commonly found in networked sys¬ 
tems in nature and society id. While it is almost common 
sense to realize the existence of communities, extracting such 
mesoscopic structures efficiently and accurately remains a 
challenging task and yet it is crucial to the understanding of 
the functionality of these systems. Although the definition 
of community remains ambiguous and a commonly accepted 
definition is lacking, many detection algorithms have been de¬ 
veloped in the past decade iQl with most of them based on the 
concept that there should be more edges within the community 
than edges connecting to the outside Qii. This viewpoint 
on what a community is about implies that communities are 
disjoint, and it is behind the design of non-overlapping com¬ 
munity detection algorithms. However, it was soon found by 
empirical studies that it is common for communities to over¬ 
lap, i.e. each vertex may have multiple memberships dH] and 
thus it may be shared by communities. Several approaches 
have been proposed for detecting overlapping communities, 
including clique percolation ijst], link partitioning |@-{^, lo¬ 
cal expansion and optimization ll 9 l- [lll] . and label propaga¬ 
tion 1I12I - I14I] . These methods perform well when overlapping 
vertices constitute a small portion of the network, but most 
of them fail to detect communities when the overlaps are sig¬ 
nificant OHS- The reason is that most of these algo¬ 
rithms are still based on the assumption that a community 
should have more internal than external edges and thus the 
communities are only slightly overlapping, which is invalid 
when the vertices have multiple memberships. For example, 
if all members of a community have two or more member¬ 
ships, it is highly likely that the community has more edges 
connecting to other communities than internal edges. A new 
concept of community is therefore needed for cases of sig¬ 
nificant overlaps. We adopt an intuitive idea that each mem¬ 
ber of a community should be connected to a certain fraction 
of the other members. The idea is similar to k-core, but it 


is fraction-based and we call it f-core. Unlike k-core, an/- 
core allows its members to have multiple f-cores and thus it 
is suitable for describing cases that a member could belong to 
many communities. Yet, an/-core is not necessarily a single 
connected component and could sometimes be two or more 
separate clusters. To make it a useful concept in the context 
of communities with significant overlaps, a further constraint 
that a community has members who are densely connected to 
each other and thus a relatively high value of clustering coef¬ 
ficient will prove effective. 

Structurally, communities with significant overlaps are hid¬ 
den under dense and messy edges, unlike the cases of disjoint 
and slightly overlapping communities. Identifying such com¬ 
munities is highly non-trivial from a global or top-down view¬ 
point. This motivated us to approach the problem from a local 
or bottom-up viewpoint. Starting locally from a vertex, it is 
easier to identify which groups a vertex belongs to in the sub¬ 
network consisting of the vertex itself and its neighbors, i.e. 
the ego network of a vertex, as illustrated in Fig. [T] using data 
from an online social network. The local view gives a nat¬ 
ural sample of vertices and edges that allows us to visualize 
the hidden community structure clearly though partially. It is 
partial because the ego network only reveals part of each of 
the different communities that a vertex belongs to. The idea 
is then to replace a very difficult task from global approaches 
by many easier tasks of finding community structures from a 
local approach and construct a proper way to aggregate the 
results. This inspired the present work. Here, we propose 
a novel and efficient approach based on local views of the 
vertices for detecting communities with significant overlaps 
that works in linear time. Two steps are involved: detect¬ 
ing communities locally for each vertex and merging simi¬ 
lar ones to recover the complete communities. Our method 
has many advantages. As in other algorithms based on a lo¬ 
cal approach the method does not require an input on a 
pre-set number of communities to search for and it has a lin¬ 
ear time complexity for sparse networks. Most importantly. 
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the assumption of disjoint or slightly overlapping communi¬ 
ties is abandoned and the method is designed to handle the 
possibility of multiple memberships of a vertex and detect 
significantly overlapping communities. Our method also al¬ 
lows vertices to be homeless, i.e., they do not belong to any 
communities. This feature is very important in dealing with 
real-world networks, and yet very few algorithms considered 
this possibility id. Community detection algorithms must 
be able to distinguish real communities from pseudo commu¬ 
nities lEld. We are well aware of the issue and our method 
sifts out real communities by applying proper thresholds. All 
these advantages make the method uniquely capable of detect¬ 
ing communities with significant overlaps efficiently in large- 
scale real networks with hundreds of millions of vertices. 

The plan of the paper is as follows. In Sec. [Ill we intro¬ 
duce the details of the algorithm, including similarity mea¬ 
sure, merger of similar communities, thresholds, and applica¬ 
bility. In Sec.|Inl the method is tested against two benchmarks 
and its performance in accuracy and efficiency is compared 
with two other recently proposed algorithms. In Sec. IIVI we 
apply the method to a large empirical data set and show that 
significantly overlapping communities are common in social 
networks. Results are summarized in Sec. lY] 



FIG. 1: (Color online) An ego network of a vertex provides the local 
information and reveals several partial communities. The network 
was constructed from data collected from Sina Weibo, an online so¬ 
cial network akin to the hybrid of Facebook and Twitter. The partial 
communities are found by an existing algorithm as described in Ap- 
pendixl^ PCMA is an efficient and accurate algorithm for detecting 
complete communities in a huge network by properly merging partial 
communities revealed by the ego networks of all the vertices. 


II. PARTIAL COMMUNITY MERGER ALGORITHM 
A. General Principle 

We aim to detect communities in a network in which ver¬ 
tices could span from being homeless to belonging to mul¬ 
tiple communities. This renders many top-down algorithms 
invalid. We first give a physical picture of our algorithm. Con¬ 
sider a community in which every member is connected to a 
certain fraction of the other members. At the local level of the 
members, they only known their own neighbors and have no 
knowledge of the complete community. They are given the 
task of compiling a roster of the community and identify who 
the core members are. To complete the task, each member 
shares its local information consisting of a name list including 
itself and all its neighbors. A complete roster can in principle 
be derived by merging these individual name lists skillfully. 
Those who appear frequently on the lists are the core mem¬ 
bers, while those with less occurrence are on the periphery of 
the community. This merger process is the core idea of our 
method of detecting communities. 

Practically, we start with exploring the ego network of a 
vertex, i.e. the subnetwork consisting of the vertex itself and 
its neighbors, and identifying the communities hidden in it. 
This is illustrated in Fig.[T]for a vertex (the central one) in an 
online social network. This local view lets us see the commu¬ 
nities clearly. Since a vertex may not know all the members in 
each of its communities, the information on the communities 
found based on the ego network of a vertex is incomplete. We 
refer to them as the partial communities from the viewpoint 
of the vertex. This process can be carried out for every vertex. 
Although each member only helps reveal part of the whole 
picture, the idea is that aggregating local information should 
reveal the complete communities, i.e., every community with 
all its members. With the partial communities revealed by dif¬ 
ferent vertices, we need to determine which ones are actually 
different parts of the same community. Merger of these partial 
communities in a proper way gives a complete community. It 
is technically difficult task as vertices may be misclassified 
into partial communities that they actually do not belong to. 
The merger process leads to much noise in the merged com¬ 
munities. A cleaning process or post-processing scheme is 
then invoked to eliminate the misclassified vertices and sift 
out the real and complete communities. 

Our method thus consists of three steps: 

1. Find the partial communities in the ego network of each 
of the vertices. 

2. Merge partial communities that are parts of the same 
community to reconstruct complete communities. 

3. Clean the merged communities to sift out real commu¬ 
nities. 

For easy reference, we call the method Partial Community 
Merger Algorithm or PCMA in short. It is a general approach. 
It can be implemented in different ways. For Step 1, many ex¬ 
isting algorithms are available and we use the one proposed 
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by Ball et al. S, with details given in Appendix lAl The 
new elements are Step 2 and Step 3. Below, we introduce our 
implementation of Steps 2 and 3 in detail. 


B. Merger 


The merger process aims to determine whether two partial 
communities are part of the same complete community. This 
is a classic clustering problem. We thus extend an idea from 
agglomerative hierarchical clustering to the present purpose: 
Start with a set of partial communities, merge the most similar 
pair of communities into one and repeat the merger until the 
similarity is below a threshold. 

Care must be taken in choosing a suitable similarity mea¬ 
sure between communities, each is represented by a set of ver¬ 
tices. The Jaccard index is a common similarity measure. It 
is defined to be the size of the intersection divided by the size 
of the union of two sets. A drawback of the index is that the 
members are assumed equal. A merged community, however, 
contains core members, peripheral members, and even mis- 
classified ones. They should be treated differently. Here we 
propose a novel similarity measure that incorporates the dif¬ 
ferent importance of members. 

For a Vertex i in Community C, let Sifi be a score that 
represents its importance in C. Without loss of generality, we 
define Si^ = 0 if t ^ C. Let Iq be the number of partial 
communities that have merged to form Community C. Before 
the merger, the partial communities identified by Step 1 all 
have I = 1 and all members carry an initial score of 1. When 
two communities A and B merge into one, e.g. C = AVJ B, 
the quantities Si^c Ic are given by 

^i,C — ... 

Ic = Ia + Ib 


Physically, Si^c traces the number of occurrences of Vertex i 
in the Ic partial communities that have merged to form Com¬ 
munity C. Vertices with a high value of S/I are regarded as 
core members, and those with a small S, say less than 3, are 
very likely vertices that are misclassified. 

Consider two communities A and B. We define an asym¬ 
metric measure f{A, B) to take into account the different im¬ 
portance of members as: 


f{A,B) 


Si^B Si^A 
^ Ib wa 


( 2 ) 


where wa = X) ^i,A- The term Si^s/lB represents a normal- 

i 

ized importance of Vertex iin B and Si^A/wA is a weighting 
factor of Vertex i in A. The product Si^B • Si^A ensures that 
/ will not be affected much by the misclassified vertices, i.e. 
those with small values of S. A large value of f{A,B) in¬ 
dicates that the core members of A are also core members of 
B, but not vice versa as f{A, B) ^ f{B, A) in general. This 
measure has the following properties: 

f(A,BUC) = j^^f{A,B) + j^^f{A,C) (3) 


fiBUC,A) 


Wb 

Wb + WC 


f{B,A) 


WC 

Wb + WC 


f{C,A), 

( 4 ) 


which can be readily shown. Let {A} ({i?}) denote the set 
of partial communities that form the Community A (B). It 
follows from Eqs. 0 and (IHi that 


f{A,B) 






G{A} 

ve{B} 


waIb 


f{x,y) ■ 


( 5 ) 


Recall that x and y are partial communities and thus f{x, y) 
is the portion of members of x who are also members of y, i.e. 

ft \ \^^y\ 

J{x,y)= 1^1 ■ (6) 

Equation 0 indicates that f{A, B) is actually a weighted av¬ 
erage of the overlap portion f{x,y) over all combinations of 
partial communities forming A and B, i.e. with {{x,y),x G 
{A},y G {B}}. 

The merger of two communities A and B is different from 
either A absorbing B or B absorbing A. Thus, a symmetric 
analogy of f{A, B) is preferred for deciding a merger. To 
motivate the construction of such a parameter, we introduce a 
measure g{C) of a Community C in a way similar to Eq. 0 
that compares members in the partial communities forming C: 


9{C) 


1 


Sa:,ye{C'} 

x^y 


Wxf(x,y) 

wc(lc-t) 


Ale = 1 
Ale > 1 


( 7 ) 


It gives the average portion of overlap between partial com¬ 
munities in a merged community. Its physical meaning can be 
seen by considering the special case that C is an ER random 
network G{n,p) with members randomly connected with a 
probability p. A partial community now consists of a vertex 
and all its neighbors. The expected portion of overlap be¬ 
tween two partial communities f{x, y) is roughly p, giving 
g{C) sa p. This indicates that g{C) is approximately a mea¬ 
sure of the fraction of other members that a member is con¬ 
nected to. A larger g{C) implies denser internal edges in the 
community and thus members are connected tightly to each 
other. It can be used as an indicator on whether a merged 
community is a real community or just a wrongly merged set 
of vertices. 

Eor the case C = A\J B, g{C) satisfies 

= wc{L - 1 ) ~ wb{Ib - l)g{B) 

-\-(wa^b A wbIa) f s{A^ B)} , ( 8 ) 


where 

f ( A m _ waIbHA, B) + wbUHB, A) 

Js[A, B) — 111 

waIb + wbIa 
^ ‘2J2tSi,A ■ Sj^B 
WaIb + wbIa 

There are three terms in Eq. 0 for g{C). The first and sec¬ 
ond terms give the overlap portions within A and within B, 
respectively. 
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The third term in Eq. ® measures the overlap between 
A and 13. It is important to note that a symmetric measure 
fs {A, B), as defined in Eq. (|9l), emerges. It is a weighted aver¬ 
age of the asymmetric measures f{A, B) and f{B, A) and yet 
itself satisfies fs{A, B) = fs{B,A). It follows from Eq. (|9]l 
that fs{A, BUC) is given by a weighted average of fs{A, B) 
and fs{A, C), and thus 

/,(ASUC) <max{/,(AB),/,(AC)} . (10) 

We are thus led to apply fs as a symmetric similarity mea¬ 
sure between two communities that accounts for the different 
importance of the members. 

Based on the idea of agglomerative hierarchical clustering, 
the merger process using fs as the similarity measure can be 
implemented as follows. Given a set ^ of communities to be 
merged, a straightforward way is to: 

1. Calculate fs for each pair of communities in ^ and 
maintain a priority queue of fs in descending order. 

2. Merge the pair with the largest fs and update the prior¬ 
ity queue. 

3. Repeat 2 until the largest fs in the priority queue falls 
below a threshold t . 

The time complexity of this algorithm is 0{nf logn), where 
n is the number of communities in The space complexity 
is 0{'nf) as we need to maintain the priority queue of fg. Eor 
detecting communities in large-scale networks, a more effi¬ 
cient algorithm is desirable. In what follows, we propose two 
optimizations to reduce both the time and space complexity to 
0(71). 

We define the best merger candidate of a community A as 
bmc(A) = argmax/s(A, X) (11) 

We argue that the algorithm above is equivalent to: 

1: given a set of communities 

2 : repeat 

3: choose a community A from 

4: B -(r- bmc(A) 

5: while bmc(i3) Ado 

6 : At^ B 

7: B -It- bmc(A) 

8 : end while 

9 : a fs(A,B) > tf^ then 

10: merge A and B 

11: remove A and B, add A U i? to ^ 

12: end if 

13: until no communities can be merged anymore 

14: return ^ 

The algorithm makes use of the property of fs given in 
Eq. ([Tol l. If A and B are the best merger candidates of 
each other, there does not exist a community C that gives 
fs(A, C) > fs(A, B), where C can be any combination of 
communities in jA. Therefore, even if fs{A,B) is not at 
the top of the fs priority queue, the merger of A and B can 


be moved forward since other mergers higher on the priority 
queue that would take part will not affect the merger of A and 
B. An advantage is that merges are not required to proceed in 
order in the algorithm, and thus there is no need to maintain 
the fs priority queue. The space complexity is reduced from 
0{n^) to 0(n). 

The search on bmc(A) is formally within the set jA. 
Practically, the search area can be reduced significantly, as 
most of the communities in ^ jA do not even share a single 
member with A in sparse networks. A good approximation is 
to limit the search to the partial communities from the view¬ 
points of A’s members and the merged communities contain¬ 
ing these partial communities. As such, the time complexity 
of calculating bmc(A) does not scale with n, providing that 
the community size and the number of partial communities 
per vertex are independent of the network size. The number 
of iterations of finding a pair of communities to merge should 
also be independent of n. The repeated loop requires a time 
complexity of 0(n). Since n usually scales linearly with the 
network size, it can also be regarded as the network size. We 
thus argue that the time complexity of our optimized merger 
algorithm is approximately 0(n). This is verified numerically 
in Sec.lm] 


C. Post-processing 

After merging communities, a cleaning process is needed 
to handle two types of noise. Eirst, we need to identify which 
merged communities are real communities and which are sim¬ 
ply merged sets of vertices by coincidence. The latter usually 
contain only a small number of partial communities because 
they are merged by accidents. The more partial communities a 
merged community contains, the more likely it is a real com¬ 
munity. Thus, the parameter I of a community can be used 
as a measure of whether a detected community is trustful. A 
way to sift out real communities is to set a threshold ti and 
require all real communities to have I > ti. The threshold 
can be set in many ways, e.g. setting ti based on each com¬ 
munity’s size, as a larger community usually caries a larger 
1. Second, we need to identify and eliminate vertices that are 
misclassified into partial communities in Step 1. Recall that 
Si^c is the number of occurrences of Vertex i in Ic partial 
communities that formed the Community C. Roughly, the 
probability of Vertex i being a false member should drop with 
Si,c- Thus, a threshold ts can be set to eliminate vertices 
with S < ts. Normally fs = 4 is sufficiently stringent and it 
should not be less than 3. There remain vertices with S>ts 
but S/I « 0. They may still not be members since they know 
too few other members. The ratio S/I gives an estimate on 
the fraction of the other members that a member is connected 
to. Another criterion S/I > ts/i becomes useful, with ts/i 
being a threshold that requires each member be connected to 
at least ts/i x 100% of the other members. This criterion 
echoes the concept off-core discussed in SecU The threshold 
ts/i can either be set uniformly for all communities or indi¬ 
vidually for each community based on the value of g, which 
reflects the average portion that a member is connected to the 
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other members. Since different kinds of networks may have 
different community structures, the choice of ts/i depends on 
the nature of communities of a specific network under study. 

D. Applicability 

PCMA works under two conditions; Existence of partial 
communities (Step 1) and adequate overlap between partial 
communities for mergers (Step 2). Usually, the second condi¬ 
tion is satisfied automatically when the network under study 
meets the first condition. For the first condition, the existence 
of partial communities from the viewpoint of a vertex requires 
that there are sufficient number of neighbors and a high den¬ 
sity of edges among the neighbors, i.e., a high local clustering 
coefficient of the vertex. 

We expect a community detected by PCMA to have the fol¬ 
lowing properties; 

1. Two members with common neighbors are highly likely 
neighbors of each other. As a consequence, the commu¬ 
nity has a relatively high value of clustering coefficient. 

2. The shortest distances between most pairs of members 
are generally short and not longer than 3. Thus, most 
members are connected to each other either directly or 
via one/two intermediate member(s). 

3. Each member is connected to at least a certain fraction 
of the other members. 

From another perspective, these properties can be taken as a 
broad descriptive definition of community, and are well suited 
for describing communities with significant overlaps. PCMA 
is designed to detect this kind of communities, which are im¬ 
portant in large-scale systems with vertices typically having 
multiple memberships. 

III. BENCHMARKING 

We tested PCMA using two benchmark models to illustrate 
its performance and applicability. Results are compared with 
two fast and accurate overlapping community detection algo¬ 
rithms that are among the best Ifsll ; namely OSLOM ifTTl] and 
SLPA iflil . 

First, a simple benchmark model in the spirit of the planted 
1-partition model ifl^ is used. The network is generated as 
follows; 

1. Generate an ER random network of n vertices with a 
mean degree (k) that serves as background noise. 

2. Randomly sample s vertices as a community, with s sat¬ 
isfying the Poisson distribution with an expected value 
of (s). Connect each pair of members with a probability 
P- 

3. Repeat the step to generate n ■ (c) communities. Here, 
(c) is the expected number of communities that a vertex 
belongs to. 


This model is flexible in that the total number, size, and intra¬ 
community edge density, as well as the background noise level 
can be directly controlled. In addition, vertices can belong 
to several communities or even be homeless. There is no 
guarantee that there are more edges within a community than 
edges going out. These features make many existing commu¬ 
nity detection algorithms invalid. They reflect the challenges 
posed by real social networks, in which a person often simul¬ 
taneously belongs to many groups, on community detection. 
PCMA is designed to solve the problem. 

Consider a network with n = 10®, (k) = 3, p = 0.3, 
(s) = 40, and (c) = 2 generated as described. The threshold 
tf^ = 0.1 is chosen for the merger process. Figure|2]shows the 
actual community size distributions generated by the model 
(diamonds). The results as detected by PCMA before (circles) 
and after post-processing (squares) are shown for comparison. 
As discussed in Sec. Ill Cl the merged communities without 
post-processing are noisy. The results show a peak at a small 
community size due to many coincidentally merged sets of 
vertices, which are targeted for removal in post-processing. 
The results also show a bump in the distribution at large com¬ 
munity sizes. Although these are real communities, the many 
misclassified vertices make their sizes bigger than their actual 
sizes. Hence, results before post-processing could be mis¬ 
leading. By setting the thresholds properly, as given in the 
captions of Fig. |2l the distribution after post-processing is in 
good agreement with the actual community size distribution. 
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FIG. 2; (Color online) Community size distribution as degenerated 
by the benchmark model (diamonds) described in the text. The cor¬ 
responding results as detected by our method before (circles) and 
after (squares) applying the post-processing scheme are shown for 
comparison. The thresholds in the method are set to be; tf^ = 0.1, 
U — 10, ts = 3, and ts/i — 0.1. 

To qualify the accuracy of PCMA, we adopt the widely 
used Normalized Mutual Information (NMI) as ex¬ 
tended by Fancichinetti et al. to compare overlapping 
communities. Figure Oa) compares the performance of 
OSFOM d and PCMA on synthetic networks with differ¬ 
ent intra-community edge densities. Results of SFPA are not 
shown because the method cannot detect homeless vertices. 
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FIG. 3: (Color online) Performance comparisons of PCMA and widely used existing methods in a simple benchmark model and the LFR 
benchmark model. Unless stated otherwise, parameters of our simple benchmark model in (a) and (b) are: n = 10“*, (k) = 20, p = 0.3, 
(s) = 40, (c) = 3. Parameters of the LFR benchmark model in (c) are: n = 10^, (fc) = 40, kmax = 100, p = 0.3, n = 2, T 2 = 1, 
Cmin = 20, Cmax = 100, each Overlapping vertex has two communities. The thresholds for PCMA are the same as those given in Fig.|2 The 
number of iterations of OSLOM is set to r = 10. For SLPA, the program applies different thresholds ranging from 0.01 to 0.5 by default and 
we select the best result. Each data point is an average of 10 realizations. If not shown, the error bar is smaller than the size of the symbol. 


The accuracy of OSLOM depends strongly on the number 
of iterations. We use the default value r = 10 suggested in 
Ref. (nil, unless specified otherwise. For PCMA, it is sen¬ 
sitive to the intra-community edge density p because it af¬ 
fects the existence of partial communities, which is a criterion 
for the applicability of PCMA (see Sec. IIIDb . In the bench¬ 
mark model, the probability p promotes partial communities. 
PCMA works well when 

(fcnn) = [((s) - 1)P - 1]P > 2 , (12) 

where ((s) — l)p is the expected number of neighbors of 
a member and (/c„„) is the expected number of edges of a 
neighbor connecting to other neighbors of the member. The 
neighbors start to be strongly connected, i.e. partial commu¬ 
nities emerge, when > 2. From Fig. [3a), PCMA per¬ 
forms better than OSLOM for p > 0.28, corresponding to 
{knn) > 2.78 for the case of (s) = 40. 

Figure [3b) shows the dependence of the performance on 
(c), which controls the expected number of communities that 
a vertex belongs to. A larger (c) corresponds to more edges 
connecting the communities and thus a denser and more com¬ 
plex network. For a system with n = 10^ members, the accu¬ 
racy of OSLOM falls rapidly with increasing (c). For PCMA, 
the accuracy remains high throughout, with a slight drop due 
to the finite size of the network instead of (c). This is verified 
by the performance of PCMA in a bigger system of n = 10^ 
(circles in Fig. [3b)). Recall that many existing algorithms be¬ 
come invalid in problems that a vertex may belong to many 
communities, but PCMA handles them well. 

We also tested PCMA with the widely used LFR bench¬ 
mark model iH. Figure [3c) compares the performance of 
PCMA with OSLOM and SLPA. In the LFR model, a vertex 
has a degree chosen from a distribution that follows a power- 
law of exponent ti in a range of degrees fcmm ^ k < kmax 
corresponding to a mean degree (fe). A tunable fraction of 
vertices are chosen to belong to more than one communities. 
They are the overlapping vertices. The remaining vertices 
have only one community. For a vertex of degree k, a pa¬ 
rameter /i sets the fraction of the edges to be connected to 


vertices outside the community(ies) that the vertex belongs 
to. The remaining fraction (1 — p) of edges are evenly di¬ 
vided among the communities, if the vertex is chosen to have 
multiple communities. As such, the community sizes also fol¬ 
low a power-law with an exponent T 2 within a range of com¬ 
munities sizes between Cmin and Cmin- The combinations of 
parameters in the LFR model give a class of tunable struc¬ 
tures for the resulting networks. From Fig. [3c), all the three 
methods work very well when there are very few overlapping 
vertices. When communities overlap more, PCMA performs 
better than the other two methods over a wide range of the 
fraction of overlapping vertices, except for the last data point 
in Fig. [3c) in comparison with OSLOM. In the LFR model, 
the degree assignment does not distinguish single-community 
vertices from multi-community vertices. As a result, a vertex 
belonging to multiple communities has fewer edges connect¬ 
ing to each of its communities. In PCMA, however, mem¬ 
bers are expected to be connected to at least a certain fraction 
of the other members before establishing their membership. 
This leads to the gradual drop in PCMA’s NMI with increas¬ 
ing number of overlapping vertices, which are members ac¬ 
cording to the benchmark model but may not be identified by 
PCMA. We remark that it is actually not a problem of accu¬ 
racy, but more about what a community should be. 

We also studied the time complexity of the three methods 
numerically based on the LFR benchmark model im. Cal¬ 
culations were performed on a workstation with Intel Xeon 
E5-2609 @ 2.4GHz (4 cores / 8 threads). The programs were 
allowed to use all threads if they were parallelized. Figured 
shows how the execution time scales with the network size. 
SLPA and PCMA are almost equally fast and OSLOM is 
slower by at least 500 times. In the log-log plot in Fig. [4] 
the slopes for SLPA, OSLOM and PCMA are 1.10, 1.09 and 
1.00, respectively. It is, therefore, numerically verified that 
the time complexity of PCMA is 0{n). 

In summary, the benchmark tests showed that PCMA is an 
efficient algorithm specifically suitable for detecting commu¬ 
nities in networks in which the vertices may belong to multiple 
communities. 
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FIG. 4: (Color online) Comparison on time complexity. Tests are 
conducted on LFR benchmark model. The fraction of overlapping 
vertices is set to 50%. Other parameters are the same as those in 
Fig.E] Each data point is an average over 10 realizations. Error bars 
are smaller than the size of the symbols. 


IV. EMPIRICAL ANALYSIS 


Having established the efficiency and accuracy of PCMA, 
we apply it to analyze data in a huge social network in China, 
Sina Weibo, akin to a hybrid of Twitter and Facebook. It is 
a directed network like Twitter for fast information spread¬ 
ing and it also has characteristics of Facebook for interactions 
with friends. We focused on the embedded undirected friend¬ 
ship network with only reciprocal edges and applied PCMA 
to extract its community structures. The network we sampled 
from the Internet contains about 80 million vertices and 1.0 
billion reciprocal edges, with only 1.2% of the edges being 
connected to vertices that are not sampled. The sampled net¬ 
work can thus be roughly regarded as the core of the whole 
network. There are many vertices that have only a few neigh¬ 
bors and we do not expect to find partial communities by 
searching over these vertices. Instead, we only need to search 
the ego networks of vertices with a degree equal or higher than 
a specific value, which is taken to be 20 here. Some abnormal 
vertices with extremely high clustering coefficients are also 
omitted. It only took PCMA about 35 hours to complete the 
detection in such a huge network and found millions of com¬ 
munities on an ordinary workstation. 

To illustrate the properties of the communities, Fig.|5]shows 
two histograms for the communities as a function of their 
value of g and community size, for two different thresholds 
ti = 10 and 2. Each vertical cut gives the distribution of g for 
communities of the same size. The values in each cut are re¬ 
scaled by mapping the highest value to unity. Usually a larger 
community would have a lower g, since the number of friends 
a person could have is limited and does not grow linearly with 
the community size. The results in Fig.|5ja) illustrate this re¬ 
lationship and show that a member typically knows 20 — 30% 
of the other members in communities that are not too big. The 


results also show that PCMA is a successful algorithm in that 
most of the detected communities have relatively large val¬ 
ues of g, indicating that they are real communities. Choosing 
a low threshold of = 2 gives an abnormal plunge in g at 
small community sizes as shown in Fig.|3b). The low thresh¬ 
old leads to many false communities that are merged only a 
few times, as discussed in Sec. Ill Cl Raising ti from 2 to 10 
reduces the number of detected communities from 4.6 million 
to 0.9 million and removes most false communities. This suc¬ 
cess is accompanied by the drawback that real communities 
with less than 10 vertices are also removed. We also modified 
Eq.® slightly to deal with the empirical data better. Although 
the threshold = 0.1 is sufficiently harsh for large partial 
communities, it can easily be satisfied by small ones. Eor ex¬ 
ample, two partial communities of size 10 only need a single 
common member to meet the criterion. Therefore, we sup¬ 
press such unwanted mergers of small partial communities by 
forcing fs{A, B) = 0 if Y,i Si ,a ■ Si,B/ max{(A, Is} < f/o- 
We used f/o = 4 in our analysis. 


bO 



FIG. 5: (Color online) PCMA was applied to detect communities 
in a huge data set collected from a social network. The resulting 
communities have different sizes and g (see Eq. (7)). Histograms of 
g and community size among the detected communities are shown 
for two different thresholds tj. Unless otherwise stated, parameters 
are the same as those given in Eig.[2l 

The detected communities also confirm the multiple mem¬ 
berships of a vertex and thus the significant overlap in com¬ 
munities. Figure |6] gives a complete community that is only 
partially revealed in the ego community in Fig. [T] The com¬ 
munity is found by merging I = 49 partial communities. Af¬ 
ter mergers, it has 287 members and most of them are mis- 
classified. Setting thresholds tg^i and tg in post-processing 
removes misclassified members, leaving the detected commu¬ 
nity with 58 members. It has a value of g = 0.50, implying 
that on average every member knows about half of the other 
members. The community actually shows a core-periphery 
structure, i.e., there is a group of key members knowing most 
members and many peripheral members knowing only the key 
members. The number on a vertex in Fig. |6] gives the number 
of communities that the vertex belongs to. Most vertices have 
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multiple memberships and some of them even belong to more 
than 10 communities. The number of edges going out of the 
community is 5676, much larger than the number of recip¬ 
rocal edges 733 x 2 within the community (counted twice). 
This also confirms that one should not assume a community 
to have more internal edges than external edges when there 
are significant overlaps between communities. 



FIG. 6: A real community consisting of 58 members as revealed by 
applying PCMA to real social network data set. The label on a vertex 
gives the number of different communities that the vertex belongs 
to. The number of reciprocal edges within the community (counted 
twice) is 733 x 2, and the number of edges (not shown) connecting 
to the outside is 5676. 


V. SUMMARY 

We proposed and implemented a Partial Community 
Merger Algorithm specifically designed for detecting commu¬ 
nities in big data sets in which a member may have multiple 
memberships. The structure of these communities is signi¬ 
fied by a strong overlap in members and thus a community 
may have many edges connecting to the outside compared to 
those within the community. Such structures make many ex¬ 
isting community searching algorithm invalid, but yet they of¬ 
ten show up in real-world systems. Through PCMA, we pro¬ 
vided a conceptual framework as well as a practical algorithm 
in dealing with these systems and supplemented the toolbox 
in community detection in complex network science. Details 
of implementing PCMA were discussed. We used two bench¬ 
mark models and compared results with two widely used al¬ 
gorithms to establish the validity and accuracy of PCMA. The 
method does not need a prior knowledge of the number of 
communities to search for and it is capable of analyzing com¬ 
munities in large-scale networks in linear time. The algorithm 
is applied to a huge social network data set. In addition to 
identifying the communities, PCMA also gives who the key 
members are in a community and how many different com¬ 
munities a member belongs to. The high accuracy and linear 


time complexity makes PCMA a promising tool for detect¬ 
ing communities with significant overlaps in huge social net¬ 
works, which cannot be handled by most existing algorithms. 

We end with a few remarks. Although we described and 
implemented PCMA only for unweighted networks, the ap¬ 
proach is flexible and it can be readily extended to treat 
weighted networks. Another extension is to properly tune the 
threshold t for exploring the hierarchical structure of com¬ 
munities. Like any other algorithm, PCMA also has its lim¬ 
itations. It is not designed to detect small communities and 
it will not work in networks that are too sparse. There is also 
the common problem among algorithms on distinguishing real 
communities from false ones. This is actually a deeper ques¬ 
tion because whether there really exists a clear boundary for 
distinguishing “real communities” from “false ones” is ques¬ 
tionable. A more practically approach would be to explore 
methods of choosing the thresholds properly or construct a 
function involving g, I, and community size to make the de¬ 
tected communities more trustful. Finally, the source code of 
our implementation of PCMA is released as free software un¬ 
der the GNU General Public License version 2 or any later 
version (GPLv2h-) iI^ . 
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Appendix A: Searching for Partial Communities 

Before carrying out PCMA, we need to search for partial 
communities in the ego network of each vertex. We adopt an 
efficient community detection algorithm proposed by Ball et 
al. s for the purpose. Starting from an ego network of size 
n, the algorithm takes the number K of communities to be 
found as an input. The output is an n x AT matrix, with the 
K numbers in a row that signifies a vertex being the belong¬ 
ing coefficients of the vertex for the K communities. For 
example, we set out to find 5 communities in an ego network, 
labelled by Cl to C5. Then each row has 5 numbers, e.g. 
0.64, 0.29, 0, 0.07, 0 for the j-th row, denoting that the vertex 
j has a portion of 64% belonging to Cl, 29% to C2, and 7% 
to C4. To convert the fuzzy assignment of members to 
definite assignment, we impose a threshold that a community 
carries a vertex j as its member only if the belonging coeffi¬ 
cient of the vertex j for the community is above 0.20. In the 
example, only communities Cl and C2 have the vertex j as 
a member. For the central vertex of the ego network, as the 
communities are its partial communities, it is treated as a de¬ 
fault member of all the communities regardless of the thresh¬ 
old. Partial communities can then be derived from every ego 
network. We remark that although the number of partial com¬ 
munities in an ego network is not known in prior to the search, 
it can easily be estimated for small networks. An appropri¬ 
ate overestimation is necessary as the homeless vertices in an 
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ego network also need to be accommodated by some com¬ 
munities. Such overestimation is harmless as the false partial 
communities can be handled properly in the merger and post¬ 
processing steps that follow. It is reasonable to assume the 
number of partial communities is proportional to the ego net¬ 
work size. In the present work, the number of communities to 
be found is (over)estimated as 1 /30 of the network size, plus a 
lower bound of 5 and an upper bound of 20. The overestima¬ 
tion will not cause problem, as the true communities can be 
merged when they have a significant overlap. For the example 


in Fig.[T] we set out to find 10 communities and obtained 3 ap¬ 
parent partial communities (coloured purple, blue and green) 
and 2 possible ones (red and yellow). The green one actually 
consists of 3 highly overlapping communities which should 
be regarded as one. We merge a pair of partial communities if 
either one shares more than 30% the members of the other. 

It is important to note that PCMA does not require a high 
accuracy in this step of finding all the partial communities. 
Any error generated in this step can be greatly reduced by the 
mergers and post-processing step, as discussed in Sec. Ill Cl 
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